3  Introduction to Tidymodels

“I think umpires have too much power, without any system of checks and balances and the more money a player makes, the more the umpire tries to show off that power to him. Unfortunately, since I signed my contract my strike zone has suddenly become a lot larger.” - Ozzie Smith

Ozzie Smith in the middle of his backflip during the middle of the 1985 World Series, when the Cardinals had a 3-1 series lead. The Cardinals would end up losing the World Series after The Kansas City Royals won the last three games.

In sports analytics, one of the most powerful tools we have for building predictive models is the tidymodels framework. This collection of R packages is designed to streamline the process of building, evaluating, and deploying machine learning models. In this section, we will introduce the core components of tidymodels, which include data preprocessing, model specification, training, evaluation, and tuning.

The tidymodels framework leverages the philosophy of tidy data and the consistent approach of the tidyverse, making it intuitive and easy to use. By adhering to this unified structure, you can integrate modeling workflows seamlessly with data wrangling operations like those we’ve covered in previous chapters. This makes tidymodels especially suited for analyzing sports data, where the goal is often to generate models that can predict player performance, team outcomes, or game results.

3.1 The tidymodels Framework

The tidymodels framework brings together the principles of tidy data and the tidyverse, making it easy to integrate data preprocessing, model training, and evaluation into a unified workflow. Below, we’ll explore the core components of the tidymodels framework, each of which plays a vital role in the modeling process.

3.2 recipes

In sports analytics, the data we work with often needs to be cleaned and transformed before it can be used for modeling. The recipes package in tidymodels is designed to help automate and streamline this process. A recipe is a blueprint for data preprocessing, allowing you to specify how to transform raw data into a form suitable for modeling. This can include tasks like normalizing numerical variables, encoding categorical variables, handling missing values, and creating new features.

3.2.1 The Core Structure of a Recipe

A recipe starts by defining a formula, typically in the form of a dependent variable (outcome) and independent variables (predictors), followed by one or more step_ functions that specify the transformations to be applied to the data. These steps are applied in the order they are specified in the recipe, ensuring that your preprocessing is both consistent and repeatable.

library(recipes)

# Define the recipe
recipe_obj = recipe(outcome ~ ., data = data) |>
  step_normalize(all_numeric_predictors()) |>
  step_dummy(all_factor_predictors())

In this example, we define a recipe where outcome is the dependent variable and all other variables are predictors. We first normalize all numeric predictors using step_normalize(), and then create dummy variables for all factor predictors using step_dummy().

3.2.2 Common step_ Functions

The recipes package includes several commonly used step_ functions to handle different data transformation tasks. Here are a few of the most frequently used steps:

step_normalize()

The step_normalize() function is used to standardize numeric predictors by scaling them so that they have a mean of 0 and a standard deviation of 1. This is particularly useful when predictors vary greatly in scale, such as when comparing different player statistics like points, assists, and rebounds. Normalization ensures that all variables contribute equally to the model.

recipe_obj = recipe(outcome ~ points + assists + rebounds, data = player_data) |>
  step_normalize(points, assists, rebounds)

In this example, we apply normalization to the points, assists, and rebounds columns in the player_data dataset.

step_dummy()

The step_dummy() function is used to convert categorical variables (factors) into dummy (binary) variables, which are required for most machine learning models. For example, if you have a variable like team that indicates the team a player belongs to, step_dummy() will create separate columns for each team.

recipe_obj = recipe(outcome ~ team + points, data = player_data) |>
  step_dummy(team)

Here, step_dummy() will create a binary column for each team, such as team_A, team_B, etc., and the value will be 1 if the player is on that team and 0 otherwise.

step_naomit()

The step_naomit() function is used to remove rows with missing values. In sports datasets, it’s common to encounter missing data (e.g., a player’s stats might not be available for all games), and this function can help clean the data by excluding those rows. This step is often used when missing data can’t be easily imputed or when the model cannot handle missing values.

recipe_obj = recipe(outcome ~ points + assists, data = player_data) |>
  step_naomit(points, assists)

This example removes rows where the points or assists columns contain missing values.

step_center()

The step_center() function centers numeric variables by subtracting the mean of each variable from the values in that variable. This can be useful when you want to ensure that all variables are centered around zero, particularly when working with models that are sensitive to the scale of the data, such as linear regression or neural networks.

recipe_obj = recipe(outcome ~ points + rebounds, data = player_data) |>
  step_center(points, rebounds)

Here, we center the points and rebounds columns in the player_data dataset.

step_interact()

The step_interact() function creates interaction terms between two or more variables. Interaction terms can be important in sports analytics, where the effect of one variable (e.g., player experience) might depend on the value of another variable (e.g., team performance). This step allows you to automatically create these interaction terms.

recipe_obj = recipe(outcome ~ points + experience + team_performance, data = player_data) |>
  step_interact(~ points:experience + points:team_performance)

This step creates interaction terms between points and experience, as well as between points and team_performance.

step_zv()

The step_zv() function removes any predictors with zero variance, meaning those variables that do not vary at all across the data (e.g., a column where all the values are the same). These variables do not provide any useful information for the model and can safely be removed.

recipe_obj = recipe(outcome ~ ., data = player_data) |>
  step_zv(all_predictors())

Here, step_zv() will remove any predictor columns from the player_data dataset that have zero variance.

3.2.3 Combining Steps

You can combine multiple step_ functions in a single recipe to handle a variety of preprocessing tasks. This ensures that all the necessary transformations are applied consistently across both training and testing datasets.

recipe_obj = recipe(outcome ~ points + assists + rebounds + team, data = player_data) |>
  step_normalize(points, assists, rebounds) |>
  step_dummy(team) |>
  step_naomit(points, assists, rebounds)

In this example, the recipe first normalizes the numeric variables (points, assists, rebounds), then creates dummy variables for the team variable, and finally removes rows with missing values in the specified columns.

3.2.4 Applying the Recipe

Once a recipe is defined, it can be applied to the data using the prep() function, which prepares the recipe by applying the specified transformations to the dataset. After the recipe is prepared, it can be used to bake (i.e., apply the preprocessing steps) to both training and test datasets:

prepped_recipe = prep(recipe_obj, training = player_data)

# Apply the transformations to the training and test datasets
train_data = bake(prepped_recipe, new_data = player_data)
test_data = bake(prepped_recipe, new_data = player_data)

3.3 parsnip

The parsnip package in tidymodels provides a unified interface for specifying machine learning models in R. It allows you to define the type of model you want to fit (such as linear regression, decision trees, or random forests) without worrying about the underlying implementation details. This consistency across different model types makes it easy to switch between models or experiment with multiple models within the same framework.

The power of parsnip lies in its simplicity and flexibility. You only need to specify the model type and the engine (the underlying R package or algorithm used to fit the model), and parsnip handles the rest. This approach not only simplifies the modeling process but also allows you to experiment with different modeling techniques using the same set of tools and functions.

3.3.1 Specifying Models with parsnip

When using parsnip, you first define the type of model you want to create. This is done using a function for the specific model, such as linear_reg() for linear regression or rand_forest() for random forests. Once the model type is specified, you set the engine, which determines how the model is fitted (e.g., using lm for linear regression or rpart for decision trees). The next step is to fit the model to the data.

Here’s an example of how to specify and fit a linear regression model:

library(parsnip)

# Specify a linear regression model
model = linear_reg() |>
  set_engine("lm")

# Fit the model to the data
trained_model = model |>
  fit(outcome ~ points + assists + rebounds, data = player_data)

In this example, linear_reg() specifies a linear regression model, and set_engine("lm") tells parsnip to use the lm engine to fit the model. The fit() function then trains the model using the player_data dataset, predicting the outcome variable from the points, assists, and rebounds predictors.

3.3.2 Common Models in parsnip

Here are some of the most common models you can specify using parsnip:

Linear Regression (linear_reg)

Linear regression is one of the simplest and most commonly used models for predicting a continuous outcome variable. In parsnip, the function linear_reg() is used to specify this model, and you can choose the engine (e.g., lm for the standard linear regression implementation or glmnet for regularized regression).

# Linear regression model
model_lr = linear_reg() |>
  set_engine("lm")

# Fit the model
trained_model_lr = model_lr |>
  fit(outcome ~ points + assists + rebounds, data = player_data)

Logistic Regression (logistic_reg)

Logistic regression is commonly used when the outcome variable is binary (e.g., win/loss, success/failure). In parsnip, logistic regression is specified using logistic_reg() and can be fitted using engines such as glm or spark.

# Logistic regression model
model_logistic = logistic_reg() |>
  set_engine("glm")

# Fit the model
trained_model_logistic = model_logistic |>
  fit(outcome ~ points + assists, data = player_data)

Random Forest (rand_forest)

Random forests are a powerful ensemble learning method that can be used for both classification and regression. In parsnip, you can specify a random forest model using the rand_forest() function. You can choose the engine (e.g., ranger or randomForest) to fit the model.

# Random forest model
model_rf = rand_forest() |>
  set_engine("ranger")

# Fit the model
trained_model_rf = model_rf |>
  fit(outcome ~ points + assists + rebounds, data = player_data)

Decision Trees (decision_tree)

Decision trees are a non-linear model that can be used for both classification and regression tasks. In parsnip, decision trees are specified with decision_tree(), and the engine can be set to rpart or party for different implementations.

# Decision tree model
model_tree = decision_tree() |>
  set_engine("rpart")

# Fit the model
trained_model_tree = model_tree |>
  fit(outcome ~ points + assists, data = player_data)

Support Vector Machines (svm_rbf)

Support Vector Machines (SVM) are used for both classification and regression tasks and are particularly effective in high-dimensional spaces. In parsnip, you can specify an SVM model using svm_rbf(), and you can choose engines like kernlab or e1071.

# SVM model with radial basis function kernel
model_svm = svm_rbf() |>
  set_engine("kernlab")

# Fit the model
trained_model_svm = model_svm |>
  fit(outcome ~ points + assists + rebounds, data = player_data)

Boosted Trees (boost_tree)

Boosted trees, such as Gradient Boosting Machines (GBM), are another powerful ensemble method that can be used for both classification and regression. In parsnip, you can specify a boosted tree model using boost_tree() and set the engine to xgboost or lightgbm.

# Boosted tree model
model_boost = boost_tree() |>
  set_engine("xgboost")

# Fit the model
trained_model_boost = model_boost |>
  fit(outcome ~ points + assists + rebounds, data = player_data)

3.3.3 Tuning Models

Many models, such as random forests, SVMs, and boosted trees, have hyperparameters that can be tuned to improve model performance. parsnip provides a consistent interface for setting and tuning these hyperparameters. After specifying a model, you can use the tune package to perform grid search or random search to find the optimal parameters.

For example, to tune a random forest model, you can use the tune_grid() function to search over different values for the number of trees (trees) and the maximum number of splits (min_n):

library(tune)

# Define a grid of hyperparameters
grid = grid_regular(trees(range = c(100, 1000)), min_n(range = c(2, 20)), levels = 5)

# Tune the model
tuned_rf = tune_grid(
  trained_model_rf,
  resamples = vfold_cv(player_data),
  grid = grid
)

This code runs a grid search over the number of trees and the minimum number of splits to optimize the performance of the random forest model.

3.4 workflows

The workflows package in tidymodels is designed to streamline the process of building, training, and evaluating machine learning models by combining the different stages of the modeling process into a single, unified object. This object allows you to encapsulate the entire modeling pipeline, from preprocessing steps to model specification, into a single workflow. Using workflows ensures that your data preprocessing and modeling steps are tightly integrated and that the entire pipeline is reusable, reproducible, and easy to manage.

The power of workflows lies in its ability to combine preprocessing (via recipes) with model specification (via parsnip) into a cohesive object that can be easily trained, evaluated, and tuned. This simplifies the process of applying a series of steps to both training and testing data, making your model development process more efficient and organized.

3.4.1 The Structure of a Workflow

A workflow consists of two main components: a recipe and a model. The recipe defines the preprocessing steps (e.g., data normalization, encoding, etc.), while the model specifies the type of machine learning model to be fit (e.g., linear regression, random forest, etc.). Once the recipe and model are combined into a workflow, the workflow object can be used to fit, evaluate, and tune the model.

Here’s how you can create a simple workflow:

library(workflows)

# Define the recipe
recipe_obj = recipe(outcome ~ points + assists + rebounds, data = player_data) |>
  step_normalize(points, assists, rebounds) |>
  step_dummy(team)

# Define the model
model_obj = linear_reg() |>
  set_engine("lm")

# Create the workflow
workflow_obj = workflow() |>
  add_recipe(recipe_obj) |>
  add_model(model_obj)

# Fit the workflow to the data
fitted_workflow = workflow_obj |>
  fit(data = player_data)

In this example, we first define a recipe that normalizes the points, assists, and rebounds variables, as well as creates dummy variables for the team variable. We then specify a linear regression model using linear_reg(). Finally, we combine the recipe and model into a single workflow object using workflow(), and the fit() function is used to train the model on the player_data dataset.

3.4.2 Adding Recipes and Models to a Workflow

You can easily add both the recipe and the model to a workflow using the add_recipe() and add_model() functions, respectively. Once these components are added, the workflow object is ready for training, evaluation, and tuning.

# Add the recipe and model to the workflow
workflow_obj = workflow() |>
  add_recipe(recipe_obj) |>
  add_model(model_obj)

This step ensures that your workflow is built with the appropriate preprocessing steps and model, making it ready to be applied to your data.

3.4.3 Tuning a Workflow

One of the key benefits of using workflows is that you can integrate tuning directly into the workflow process. For instance, when tuning hyperparameters for models like random forests or support vector machines, you can pass the workflow to the tune_grid() function along with a grid of hyperparameters to optimize.

library(tune)

# Define a grid of hyperparameters
grid = grid_regular(trees(range = c(100, 1000)), min_n(range = c(2, 20)), levels = 5)

# Tune the workflow
tuned_workflow = tune_grid(
  workflow_obj,
  resamples = vfold_cv(player_data),
  grid = grid
)

In this example, we use tune_grid() to tune a random forest model by varying the number of trees (trees) and the minimum number of splits (min_n). The grid search is performed over these hyperparameters using cross-validation, ensuring that the best-performing model configuration is selected.

3.4.4 Evaluating a Workflow

Once the workflow is trained, you can evaluate its performance using the yardstick package, which provides a consistent set of functions for model evaluation. For instance, after fitting the workflow, you can generate predictions on a test set and calculate metrics such as accuracy, RMSE, or AUC.

library(yardstick)

# Generate predictions
predictions = predict(fitted_workflow, new_data = test_data)

# Evaluate the performance
accuracy(predictions, truth = test_data$outcome)

Here, the predict() function is used to generate predictions from the fitted workflow, and the accuracy() function is used to calculate the accuracy of the model’s predictions on the test data.

3.4.5 Workflow and Resampling

Workflows can also be integrated with resampling methods to evaluate the model’s performance across different subsets of the data. For example, you can use rsample to perform cross-validation or bootstrapping and pass the resampled data to the workflow for evaluation.

library(rsample)

# Create a 5-fold cross-validation resampling object
cv_folds = vfold_cv(player_data, v = 5)

# Tune the workflow with cross-validation
tuned_workflow_cv = tune_grid(
  workflow_obj,
  resamples = cv_folds,
  grid = grid
)

This example uses vfold_cv() to create a cross-validation object, and the workflow is tuned using this resampling method to get a better estimate of the model’s performance across different data splits.

3.5 Concept Quiz

  1. True or False: The recipes package is designed to automate data preprocessing.

  2. What is the purpose of the step_normalize() function in the recipes package?

  3. Which of the following functions in the recipes package creates binary columns for categorical variables?

  4. What is the purpose of the tune_grid() function in tidymodels?

  5. Which package in tidymodels integrates the preprocessing and model specification steps into a single object?

  6. How do you apply the preprocessing steps defined in a recipe to a dataset in tidymodels?

  7. True or False: The workflow() function in tidymodels is used to fit a model to data.